{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datalab: A unified audit to detect all kinds of issues in data and labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cleanlab offers a `Datalab` object that can identify various issues in your machine learning datasets, such as noisy labels, outliers, (near) duplicates, drift, and other types of problems common in real-world data. These data issues may negatively impact models if not addressed. `Datalab` utilizes *any* ML model you have already trained for your data to diagnose these issues, it only requires access to either: (probabilistic) predictions from your model or its learned representations of the data.\n",
    "\n",
    "\n",
    "**Overview of what we'll do in this tutorial:**\n",
    "\n",
    "- Compute out-of-sample predicted probabilities for a sample dataset using cross-validation.\n",
    "- Use `Datalab` to identify issues such as noisy labels, outliers, (near) duplicates, and other types of problems \n",
    "- View the issue summaries and other information about our sample dataset\n",
    "\n",
    "You can easily replace our demo dataset with your own image/text/tabular/audio/etc dataset, and then run the same code to discover what sort of issues lurk within it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Quickstart\n",
    "<br/>\n",
    "    \n",
    "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you also have some numeric `features` (or model embeddings of data)? Run the code below to examine your dataset for multiple types of issues.\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```ipython3 \n",
    "from cleanlab import Datalab\n",
    "\n",
    "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
    "lab.find_issues(features=your_feature_matrix, pred_probs=your_pred_probs)\n",
    "\n",
    "lab.report()\n",
    "```\n",
    "   \n",
    "</div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install and import required dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Datalab` has additional dependencies that are not included in the standard installation of cleanlab.\n",
    "\n",
    "You can use pip to install all packages required for this tutorial as follows:\n",
    "\n",
    "```ipython3\n",
    "!pip install matplotlib\n",
    "!pip install \"cleanlab[datalab]\"\n",
    "\n",
    "# Make sure to install the version corresponding to this tutorial\n",
    "# E.g. if viewing master branch documentation:\n",
    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Package installation (hidden on docs website).\n",
    "dependencies = [\"cleanlab\", \"matplotlib\", \"datasets\"]  # TODO: make sure this list is updated\n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
    "    %pip install cleanlab  # for colab\n",
    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
    "    %pip install $cmd\n",
    "else:\n",
    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
    "                         else dependency.split('<')[0] if '<' in dependency \n",
    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
    "    missing_dependencies = []\n",
    "    for dependency in dependencies_test:\n",
    "        try:\n",
    "            __import__(dependency)\n",
    "        except ImportError:\n",
    "            missing_dependencies.append(dependency)\n",
    "\n",
    "    if len(missing_dependencies) > 0:\n",
    "        print(\"Missing required dependencies:\")\n",
    "        print(*missing_dependencies, sep=\", \")\n",
    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_predict\n",
    "\n",
    "from cleanlab import Datalab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Create and load the data (can skip these details)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll load a toy classification dataset for this tutorial. The dataset has two numerical features and a label column with three possible classes. Each example is classified as either: *low*, *mid* or *high*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>See the code for data generation. **(click to expand)**</summary>\n",
    "    \n",
    "```ipython3\n",
    "# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "from cleanlab.benchmarking.noise_generation import (\n",
    "    generate_noise_matrix_from_trace,\n",
    "    generate_noisy_labels,\n",
    ")\n",
    "\n",
    "SEED = 123\n",
    "np.random.seed(SEED)\n",
    "\n",
    "BINS = {\n",
    "    \"low\": [-np.inf, 3.3],\n",
    "    \"mid\": [3.3, 6.6],\n",
    "    \"high\": [6.6, +np.inf],\n",
    "}\n",
    "\n",
    "BINS_MAP = {\n",
    "    \"low\": 0,\n",
    "    \"mid\": 1,\n",
    "    \"high\": 2,\n",
    "}\n",
    "\n",
    "\n",
    "def create_data():\n",
    "\n",
    "    X = np.random.rand(250, 2) * 5\n",
    "    y = np.sum(X, axis=1)\n",
    "    # Map y to bins based on the BINS dict\n",
    "    y_bin = np.array([k for y_i in y for k, v in BINS.items() if v[0] <= y_i < v[1]])\n",
    "    y_bin_idx = np.array([BINS_MAP[k] for k in y_bin])\n",
    "\n",
    "    # Split into train and test\n",
    "    X_train, X_test, y_train, y_test, y_train_idx, y_test_idx = train_test_split(\n",
    "        X, y_bin, y_bin_idx, test_size=0.5, random_state=SEED\n",
    "    )\n",
    "\n",
    "    # Add several (5) out-of-distribution points. Sliding them along the decision boundaries\n",
    "    # to make them look like they are out-of-frame\n",
    "    X_out = np.array(\n",
    "        [\n",
    "            [-1.5, 3.0],\n",
    "            [-1.75, 6.5],\n",
    "            [1.5, 7.2],\n",
    "            [2.5, -2.0],\n",
    "            [5.5, 7.0],\n",
    "        ]\n",
    "    )\n",
    "    # Add a near duplicate point to the last outlier, with some tiny noise added\n",
    "    near_duplicate = X_out[-1:] + np.random.rand(1, 2) * 1e-6\n",
    "    X_out = np.concatenate([X_out, near_duplicate])\n",
    "\n",
    "    y_out = np.sum(X_out, axis=1)\n",
    "    y_out_bin = np.array([k for y_i in y_out for k, v in BINS.items() if v[0] <= y_i < v[1]])\n",
    "    y_out_bin_idx = np.array([BINS_MAP[k] for k in y_out_bin])\n",
    "\n",
    "    # Add to train\n",
    "    X_train = np.concatenate([X_train, X_out])\n",
    "    y_train = np.concatenate([y_train, y_out])\n",
    "    y_train_idx = np.concatenate([y_train_idx, y_out_bin_idx])\n",
    "\n",
    "    # Add an exact duplicate example to the training set\n",
    "    exact_duplicate_idx = np.random.randint(0, len(X_train))\n",
    "    X_duplicate = X_train[exact_duplicate_idx, None]\n",
    "    y_duplicate = y_train[exact_duplicate_idx, None]\n",
    "    y_duplicate_idx = y_train_idx[exact_duplicate_idx, None]\n",
    "\n",
    "    # Add to train\n",
    "    X_train = np.concatenate([X_train, X_duplicate])\n",
    "    y_train = np.concatenate([y_train, y_duplicate])\n",
    "    y_train_idx = np.concatenate([y_train_idx, y_duplicate_idx])\n",
    "\n",
    "    py = np.bincount(y_train_idx) / float(len(y_train_idx))\n",
    "    m = len(BINS)\n",
    "\n",
    "    noise_matrix = generate_noise_matrix_from_trace(\n",
    "        m,\n",
    "        trace=0.9 * m,\n",
    "        py=py,\n",
    "        valid_noise_matrix=True,\n",
    "        seed=SEED,\n",
    "    )\n",
    "\n",
    "    noisy_labels_idx = generate_noisy_labels(y_train_idx, noise_matrix)\n",
    "    noisy_labels = np.array([list(BINS_MAP.keys())[i] for i in noisy_labels_idx])\n",
    "\n",
    "    return X_train, y_train_idx, noisy_labels, noisy_labels_idx, X_out, X_duplicate\n",
    "```\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from cleanlab.benchmarking.noise_generation import (\n",
    "    generate_noise_matrix_from_trace,\n",
    "    generate_noisy_labels,\n",
    ")\n",
    "\n",
    "SEED = 123\n",
    "np.random.seed(SEED)\n",
    "\n",
    "BINS = {\n",
    "    \"low\": [-np.inf, 3.3],\n",
    "    \"mid\": [3.3, 6.6],\n",
    "    \"high\": [6.6, +np.inf],\n",
    "}\n",
    "\n",
    "BINS_MAP = {\n",
    "    \"low\": 0,\n",
    "    \"mid\": 1,\n",
    "    \"high\": 2,\n",
    "}\n",
    "\n",
    "\n",
    "def create_data():\n",
    "\n",
    "    X = np.random.rand(250, 2) * 5\n",
    "    y = np.sum(X, axis=1)\n",
    "    # Map y to bins based on the BINS dict\n",
    "    y_bin = np.array([k for y_i in y for k, v in BINS.items() if v[0] <= y_i < v[1]])\n",
    "    y_bin_idx = np.array([BINS_MAP[k] for k in y_bin])\n",
    "\n",
    "    # Split into train and test\n",
    "    X_train, X_test, y_train, y_test, y_train_idx, y_test_idx = train_test_split(\n",
    "        X, y_bin, y_bin_idx, test_size=0.5, random_state=SEED\n",
    "    )\n",
    "\n",
    "    # Add several (5) out-of-distribution points. Sliding them along the decision boundaries\n",
    "    # to make them look like they are out-of-frame\n",
    "    X_out = np.array(\n",
    "        [\n",
    "            [-1.5, 3.0],\n",
    "            [-1.75, 6.5],\n",
    "            [1.5, 7.2],\n",
    "            [2.5, -2.0],\n",
    "            [5.5, 7.0],\n",
    "        ]\n",
    "    )\n",
    "    # Add a near duplicate point to the last outlier, with some tiny noise added\n",
    "    near_duplicate = X_out[-1:] + np.random.rand(1, 2) * 1e-6\n",
    "    X_out = np.concatenate([X_out, near_duplicate])\n",
    "\n",
    "    y_out = np.sum(X_out, axis=1)\n",
    "    y_out_bin = np.array([k for y_i in y_out for k, v in BINS.items() if v[0] <= y_i < v[1]])\n",
    "    y_out_bin_idx = np.array([BINS_MAP[k] for k in y_out_bin])\n",
    "\n",
    "    # Add to train\n",
    "    X_train = np.concatenate([X_train, X_out])\n",
    "    y_train = np.concatenate([y_train, y_out])\n",
    "    y_train_idx = np.concatenate([y_train_idx, y_out_bin_idx])\n",
    "\n",
    "    # Add an exact duplicate example to the training set\n",
    "    exact_duplicate_idx = np.random.randint(0, len(X_train))\n",
    "    X_duplicate = X_train[exact_duplicate_idx, None]\n",
    "    y_duplicate = y_train[exact_duplicate_idx, None]\n",
    "    y_duplicate_idx = y_train_idx[exact_duplicate_idx, None]\n",
    "\n",
    "    # Add to train\n",
    "    X_train = np.concatenate([X_train, X_duplicate])\n",
    "    y_train = np.concatenate([y_train, y_duplicate])\n",
    "    y_train_idx = np.concatenate([y_train_idx, y_duplicate_idx])\n",
    "\n",
    "    py = np.bincount(y_train_idx) / float(len(y_train_idx))\n",
    "    m = len(BINS)\n",
    "\n",
    "    noise_matrix = generate_noise_matrix_from_trace(\n",
    "        m,\n",
    "        trace=0.9 * m,\n",
    "        py=py,\n",
    "        valid_noise_matrix=True,\n",
    "        seed=SEED,\n",
    "    )\n",
    "\n",
    "    noisy_labels_idx = generate_noisy_labels(y_train_idx, noise_matrix)\n",
    "    noisy_labels = np.array([list(BINS_MAP.keys())[i] for i in noisy_labels_idx])\n",
    "    # Assign few datapoints to rare class\n",
    "    random_idx = np.random.randint(0, X_train.shape[0], 3)\n",
    "    noisy_labels[random_idx] = \"max\"\n",
    "    noisy_labels_idx[random_idx] = np.max(y_bin_idx) + 1\n",
    "    \n",
    "\n",
    "    return X_train, y_train_idx, noisy_labels, noisy_labels_idx, X_out, X_duplicate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, y_train_idx, noisy_labels, noisy_labels_idx, X_out, X_duplicate = create_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make a scatter plot of the features, with a color corresponding to the observed labels. Incorrect given labels are highlighted in red if they do not match the true label, outliers highlighted with an a black cross, and duplicates highlighted with a cyan cross."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>See the code to visualize the data. **(click to expand)**</summary>\n",
    "    \n",
    "```ipython3\n",
    "# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def plot_data(X_train, y_train_idx, noisy_labels_idx, X_out, X_duplicate):\n",
    "    # Plot data with clean labels and noisy labels, use BINS_MAP for the legend\n",
    "    fig, ax = plt.subplots(figsize=(8, 6.5))\n",
    "        \n",
    "    low = ax.scatter(X_train[noisy_labels_idx == 0, 0], X_train[noisy_labels_idx == 0, 1], label=\"low\")\n",
    "    mid = ax.scatter(X_train[noisy_labels_idx == 1, 0], X_train[noisy_labels_idx == 1, 1], label=\"mid\")\n",
    "    high = ax.scatter(X_train[noisy_labels_idx == 2, 0], X_train[noisy_labels_idx == 2, 1], label=\"high\")\n",
    "    \n",
    "    ax.set_title(\"Noisy labels\")\n",
    "    ax.set_xlabel(r\"$x_1$\", fontsize=16)\n",
    "    ax.set_ylabel(r\"$x_2$\", fontsize=16)\n",
    "\n",
    "    # Plot true boundaries (x+y=3.3, x+y=6.6)\n",
    "    ax.set_xlim(-3.5, 9.0)\n",
    "    ax.set_ylim(-3.5, 9.0)\n",
    "    ax.plot([-0.7, 4.0], [4.0, -0.7], color=\"k\", linestyle=\"--\", alpha=0.5)\n",
    "    ax.plot([-0.7, 7.3], [7.3, -0.7], color=\"k\", linestyle=\"--\", alpha=0.5)\n",
    "\n",
    "    # Draw red circles around the points that are misclassified (i.e. the points that are in the wrong bin)\n",
    "    for i, (X, y) in enumerate(zip([X_train, X_train], [y_train_idx, noisy_labels_idx])):\n",
    "        for j, (k, v) in enumerate(BINS_MAP.items()):\n",
    "            label_err = ax.scatter(\n",
    "                X[(y == v) & (y != y_train_idx), 0],\n",
    "                X[(y == v) & (y != y_train_idx), 1],\n",
    "                s=180,\n",
    "                marker=\"o\",\n",
    "                facecolor=\"none\",\n",
    "                edgecolors=\"red\",\n",
    "                linewidths=2.5,\n",
    "                alpha=0.5,\n",
    "                label=\"Label error\",\n",
    "            )\n",
    "\n",
    "\n",
    "    outlier = ax.scatter(X_out[:, 0], X_out[:, 1], color=\"k\", marker=\"x\", s=100, linewidth=2, label=\"Outlier\")\n",
    "\n",
    "    # Plot the exact duplicate\n",
    "    dups = ax.scatter(\n",
    "        X_duplicate[:, 0],\n",
    "        X_duplicate[:, 1],\n",
    "        color=\"c\",\n",
    "        marker=\"x\",\n",
    "        s=100,\n",
    "        linewidth=2,\n",
    "        label=\"Duplicates\",\n",
    "    )\n",
    "    \n",
    "    first_legend = ax.legend(handles=[low, mid, high], loc=[0.75, 0.7], title=\"Given Class Label\", alignment=\"left\", title_fontproperties={\"weight\":\"semibold\"})\n",
    "    second_legend = ax.legend(handles=[label_err, outlier, dups], loc=[0.75, 0.45], title=\"Type of Issue\", alignment=\"left\", title_fontproperties={\"weight\":\"semibold\"})\n",
    "    \n",
    "    ax = plt.gca().add_artist(first_legend)\n",
    "    ax = plt.gca().add_artist(second_legend)\n",
    "    plt.tight_layout()\n",
    "```\n",
    "    \n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def plot_data(X_train, y_train_idx, noisy_labels_idx, X_out, X_duplicate):\n",
    "    # Plot data with clean labels and noisy labels, use BINS_MAP for the legend\n",
    "    fig, ax = plt.subplots(figsize=(6, 4))\n",
    "        \n",
    "    low = ax.scatter(X_train[noisy_labels_idx == 0, 0], X_train[noisy_labels_idx == 0, 1], label=\"low\")\n",
    "    mid = ax.scatter(X_train[noisy_labels_idx == 1, 0], X_train[noisy_labels_idx == 1, 1], label=\"mid\")\n",
    "    high = ax.scatter(X_train[noisy_labels_idx == 2, 0], X_train[noisy_labels_idx == 2, 1], label=\"high\")\n",
    "    \n",
    "    ax.set_title(\"Noisy labels\")\n",
    "    ax.set_xlabel(r\"$x_1$\", fontsize=16)\n",
    "    ax.set_ylabel(r\"$x_2$\", fontsize=16)\n",
    "\n",
    "    # Plot true boundaries (x+y=3.3, x+y=6.6)\n",
    "    ax.set_xlim(-2.5, 8.5)\n",
    "    ax.set_ylim(-3.5, 9.0)\n",
    "    ax.plot([-0.7, 4.0], [4.0, -0.7], color=\"k\", linestyle=\"--\", alpha=0.5)\n",
    "    ax.plot([-0.7, 7.3], [7.3, -0.7], color=\"k\", linestyle=\"--\", alpha=0.5)\n",
    "\n",
    "    # Draw red circles around the points that are misclassified (i.e. the points that are in the wrong bin)\n",
    "    for i, (X, y) in enumerate(zip([X_train, X_train], [y_train_idx, noisy_labels_idx])):\n",
    "        for j, (k, v) in enumerate(BINS_MAP.items()):\n",
    "            label_err = ax.scatter(\n",
    "                X[(y == v) & (y != y_train_idx), 0],\n",
    "                X[(y == v) & (y != y_train_idx), 1],\n",
    "                s=180,\n",
    "                marker=\"o\",\n",
    "                facecolor=\"none\",\n",
    "                edgecolors=\"red\",\n",
    "                linewidths=2.5,\n",
    "                alpha=0.5,\n",
    "                label=\"Label error\",\n",
    "            )\n",
    "\n",
    "\n",
    "    outlier = ax.scatter(X_out[:, 0], X_out[:, 1], color=\"k\", marker=\"x\", s=100, linewidth=2, label=\"Outlier\")\n",
    "\n",
    "    # Plot the exact duplicate\n",
    "    dups = ax.scatter(\n",
    "        X_duplicate[:, 0],\n",
    "        X_duplicate[:, 1],\n",
    "        color=\"c\",\n",
    "        marker=\"x\",\n",
    "        s=100,\n",
    "        linewidth=2,\n",
    "        label=\"Duplicates\",\n",
    "    )\n",
    "    \n",
    "    title_fontproperties = {\"weight\":\"semibold\", \"size\": 8}\n",
    "    first_legend = ax.legend(handles=[low, mid, high], loc=[0.76, 0.7], title=\"Given Class Label\", alignment=\"left\", title_fontproperties=title_fontproperties, fontsize=8, markerscale=0.5)\n",
    "    second_legend = ax.legend(handles=[label_err, outlier, dups], loc=[0.76, 0.46], title=\"Type of Issue\", alignment=\"left\", title_fontproperties=title_fontproperties, fontsize=8, markerscale=0.5)\n",
    "    \n",
    "    ax = plt.gca().add_artist(first_legend)\n",
    "    ax = plt.gca().add_artist(second_legend)\n",
    "    plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_data(X_train, y_train_idx, noisy_labels_idx, X_out, X_duplicate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In real-world scenarios, you won't know the true labels or the distribution of the features, so we won't use these in this tutorial, except for evaluation purposes.\n",
    "\n",
    "\n",
    "\n",
    "`Datalab` has several ways of loading the data.\n",
    "In this case, we'll simply wrap the training features and noisy labels in a dictionary so that we can pass it to `Datalab`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {\"X\": X_train, \"y\": noisy_labels}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other supported data formats for `Datalab` include: [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) and [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). `Datalab` works across most data modalities (image, text, tabular, audio, etc). It is intended to find issues that commonly occur in datasets for which you have trained a supervised ML model, regardless of the type of data.\n",
    "\n",
    "Currently, pandas DataFrames that contain categorical columns might cause some issues when instantiating the `Datalab` object, so it is recommended to ensure that your DataFrame does not contain any categorical columns, or use other data formats (eg. python dictionary, HuggingFace Datasets) to pass in your data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Get out-of-sample predicted probabilities from a classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To detect certain types of issues in classification data (e.g. label errors), `Datalab` relies on predicted class probabilities from a trained model. Ideally, the prediction for each example should be out-of-sample (to avoid overfitting), coming from a copy of the model that was not trained on this example. \n",
    "\n",
    "This tutorial uses a simple logistic regression model \n",
    "and the `cross_val_predict()` function from scikit-learn to generate out-of-sample predicted class probabilities for every example in the training set. You can replace this with *any* other classifier model and train it with cross-validation to get out-of-sample predictions.\n",
    "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = LogisticRegression()\n",
    "pred_probs = cross_val_predict(\n",
    "    estimator=model, X=data[\"X\"], y=data[\"y\"], cv=5, method=\"predict_proba\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Use Datalab to find issues in the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a `Datalab` object from the dataset, also providing the name of the label column in the dataset. Only instantiate one `Datalab` object per dataset, and note that only classification datasets are supported for now.\n",
    "\n",
    "All that is need to audit your data is to call `find_issues()`.\n",
    "This method accepts various inputs like: predicted class probabilities, numeric feature representations of the data. The more information you provide here, the more thoroughly `Datalab` will audit your data! Note that `features` should be some numeric representation of each example, either obtained through preprocessing transformation of your raw data or embeddings from a (pre)trained model. In this case, our data is already entirely numeric so we just provide the features directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab = Datalab(data, label_name=\"y\")\n",
    "lab.find_issues(pred_probs=pred_probs, features=data[\"X\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's review the results of this audit using `report()`.\n",
    "This provides a high-level summary of each type of issue found in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab.report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Learn more about the issues in your dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datalab detects all sorts of issues in a dataset and what to do with the findings will vary case-by-case. For automated improvement of a dataset via best practices to handle auto-detected issues, try [Cleanlab Studio](https://cleanlab.ai/?utm_source=internal&utm_medium=blog&utm_campaign=clostostudio).\n",
    "\n",
    "To conceptually understand how each type of issue is defined and what it means if detected in your data, check out the [Issue Type Descriptions](../../cleanlab/datalab/guide/issue_type_description.html) page. The [Datalab Issue Types](https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html) page also lists additional types of issues that `Datalab.find_issues()` can detect, as well as optional parameters you can specify for greater control over how your data are checked.\n",
    "\n",
    "Datalab offers several methods to understand more details about a particular issue in your dataset.\n",
    "The `get_issue_summary()` method fetches summary statistics regarding how severe each type of issue is overall across the whole dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab.get_issue_summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the returned summary DataFrame: LOWER `score` values indicate types of issues that are MORE severe *overall* across the dataset (lower-quality data in terms of this issue), HIGHER `num_issues` values indicate types of issues that are MORE severe *overall* across the dataset (more datapoints appear to exhibit this issue).\n",
    "\n",
    "We can also only request the summary for a particular type of issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab.get_issue_summary(\"label\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `get_issues()` method returns information for each *individual example* in the dataset including: whether or not it is plagued by this issue (Boolean), as well as a *quality score* (numeric value betweeen 0 to 1) quantifying how severe this issue appears to be for this particular example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab.get_issues().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each example receives a separate *quality score* for each issue type (eg. `outlier_score` is the *quality score* for the `outlier` issue type, quantifying *how typical* each datapoint appears to be). LOWER scores indicate MORE severe instances of the issue, so the most-concerning datapoints have the lowest quality scores. Sort by these scores to see the most-concerning examples in your dataset for each type of issue. The quality scores are directly comparable between examples/datasets, but not across different issue types.\n",
    "\n",
    "Similar to above, we can pass the type of issue as a argument to `get_issues()` to get the information for one particular type of issue.\n",
    "As an example, let's see the examples identified as having the most severe *label* issues:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "examples_w_issue = (\n",
    "    lab.get_issues(\"label\")\n",
    "    .query(\"is_label_issue\")\n",
    "    .sort_values(\"label_score\")\n",
    ")\n",
    "\n",
    "examples_w_issue.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspecting the labels for some of these top-ranked examples, we find their given label was indeed incorrect."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get additional information \n",
    "\n",
    "Miscellaneous additional information (statistics, intermediate results, etc) related to a particular issue type can be accessed via `get_info(issue_name)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "label_issues_info = lab.get_info(\"label\")\n",
    "label_issues_info[\"classes_by_label_quality\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This portion of the info shows overall label quality summaries of all examples annotated as a particular class (e.g. the `Label Issues` column is the estimated number of examples labeled as this class that should actually have a different label).\n",
    "To learn more about this, see the documentation for the [cleanlab.dataset.rank_classes_by_label_quality](../../cleanlab/dataset.html#cleanlab.dataset.rank_classes_by_label_quality)\n",
    "method.\n",
    "\n",
    "You can view all sorts of information regarding your dataset using the `get_info()` method with no arguments passed. This is not printed here as it returns a huge dictionary but feel free to check it out yourself! Don't worry if you don't understand all of the miscellaneous information in this `info` dictionary, none of it is critical to diagnose the issues in your dataset. Understanding miscellaneous info may require reading the documentation of the miscellaneous cleanlab functions which computed it.\n",
    "\n",
    "#### Near duplicate issues \n",
    "\n",
    "Let's also inspect the examples flagged as (near) duplicates.\n",
    "For each such example, the `near_duplicate_sets` column below indicates *which* other examples in the dataset are highly similar to it (this value is empty for examples not flagged as nearly duplicated). The `near_duplicate_score` quantifies *how similar* each example is to its nearest neighbor in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab.get_issues(\"near_duplicate\").query(\"is_near_duplicate_issue\").sort_values(\"near_duplicate_score\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Learn more about handling near duplicates detected in a dataset from [the FAQ](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?). \n",
    "\n",
    "Other issues detected in this tutorial dataset include **outliers** and **class imbalance**, see the [Issue Type Descriptions](../../cleanlab/datalab/guide/issue_type_description.html) for more information. `Datalab` makes it very easy to check your datasets for all sorts of issues that are important to deal with for training robust models. The inputs it uses to detect issues can come from *any* model you have trained (the better your model, the more accurate the issue detection will be).\n",
    "\n",
    "To learn more, check out this [example notebook](https://github.com/cleanlab/examples/blob/master/datalab_image_classification/datalab.ipynb) (demonstrates Datalab applied to a real dataset) and the [advanced Datalab tutorial](datalab_advanced.html) (demonstrates configuration and customization options to exert greater control)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "def precision_at_k(predicted_indices, true_indices, k):\n",
    "    return len(set(predicted_indices[:k]).intersection(set(true_indices))) / k\n",
    "\n",
    "def recall_at_k(predicted_indices, true_indices, k):\n",
    "    return len(set(predicted_indices[:k]).intersection(set(true_indices))) / len(true_indices)\n",
    "\n",
    "def jaccard_similarity(l1, l2):\n",
    "    s1 = set(l1)\n",
    "    s2 = set(l2)\n",
    "    intersect_set = s1.intersection(s2)\n",
    "    union_set = s1.union(s2)\n",
    "    if len(intersect_set) == 0:\n",
    "        return 0\n",
    "    return len(intersect_set) / len(union_set)\n",
    "\n",
    "label_issues = lab.get_issues(\"label\")\n",
    "predicted_label_issues_indices = (\n",
    "    label_issues.query(\"is_label_issue\").sort_values(\"label_score\").index.to_list()\n",
    ")\n",
    "predicted_label_issues_indices_by_score = (\n",
    "    label_issues.sort_values(\"label_score\").index.to_list()\n",
    ")\n",
    "label_issue_indices = np.where(y_train_idx != noisy_labels_idx)[0]\n",
    "\n",
    "label_quality_scores = label_issues[\"label_score\"].tolist()\n",
    "Z = (y_train_idx == noisy_labels_idx).astype(float).tolist()\n",
    "\n",
    "predicted_outlier_issues_indices = (\n",
    "    lab.get_issues(\"outlier\").query(\"is_outlier_issue\").index.to_list()\n",
    ")\n",
    "outlier_issue_indices = list(range(125, 130+1))\n",
    "exact_duplicate_idx = [index for index, elem in enumerate(X_train) if (elem == X_duplicate).all()][0]\n",
    "if exact_duplicate_idx >= 125: # if the random index selected to create a duplicate >= 125, then the last point is also an outlier\n",
    "    outlier_issue_indices.append(131)\n",
    "\n",
    "predicted_duplicate_issues_indices = (\n",
    "    lab.get_issues(\"near_duplicate\").query(\"is_near_duplicate_issue\").index.tolist()\n",
    ")\n",
    "duplicate_issue_indices = [exact_duplicate_idx, 129, 130, 131]\n",
    "\n",
    "k = len(label_issue_indices)\n",
    "assert precision_at_k(predicted_label_issues_indices, label_issue_indices, k) >= 0.75\n",
    "assert recall_at_k(predicted_label_issues_indices, label_issue_indices, k) >= 0.75\n",
    "assert precision_at_k(predicted_label_issues_indices_by_score, label_issue_indices, k) == 1.0\n",
    "assert recall_at_k(predicted_label_issues_indices_by_score, label_issue_indices, k) == 1.0\n",
    "assert roc_auc_score(Z, label_quality_scores) > 0.9\n",
    "\n",
    "assert jaccard_similarity(predicted_outlier_issues_indices, outlier_issue_indices) > 0.9\n",
    "assert jaccard_similarity(predicted_duplicate_issues_indices, duplicate_issue_indices) > 0.9\n",
    "\n",
    "expected_issue_types = set([\"label\", \"outlier\", \"near_duplicate\", \"class_imbalance\"])\n",
    "detected_issue_types = set(lab.get_issue_summary()[lab.get_issue_summary()[\"num_issues\"] > 0][\"issue_type\"])\n",
    "assert detected_issue_types == expected_issue_types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spending too much time on data quality?\n",
    "\n",
    "Using this open-source package effectively can require significant ML expertise and experimentation, plus handling detected data issues can be cumbersome.\n",
    "\n",
    "That’s why we built [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) -- an automated platform to find **and fix** issues in your dataset, 100x faster and more accurately.  Cleanlab Studio automatically runs optimized data quality algorithms from this package on top of cutting-edge AutoML & Foundation models fit to your data, and helps you fix detected issues via a smart data correction interface. [Try it](https://cleanlab.ai/) for free!\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/ml-with-cleanlab-studio.png\" alt=\"The modern AI pipeline automated with Cleanlab Studio\">\n",
    "</p>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "vscode": {
   "interpreter": {
    "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
